Module 1 Assignment: Corporate Risk Narratives in Form 10-K
Building a Classical and Neural NLP Pipeline for Industry Analysis
Important Reminder
- This assignment may be completed using Google Colab or AWS Academy.
- Your goal is to understand and implement the NLP pipeline not to scale it.
- It will be impossible to run anything on your local machine (mac or PC). Please use Google Colab or request AWS Academy labs.
- The assignment takes significant time to complete. Start early!
Submission Instructions
You must submit two files on blackboard:
- Word Document (
.docx)- Answers to all questions (Q1–Q4)
- Business interpretation
- Tables and figures
- No more than 5 pages (excluding cover page and references)
- No code in this document.
- Jupyter Notebook (
.ipynb)- Fully executable
- Clearly organized by Question (Q1–Q4)
- Well-commented code
- You can follow following structur for your solution notebook, feel free to adjust as needed:
- config and paths
- load and parse 10k
- clean and chunk text
- feature engineering
- predictive models
- error analysis
Objective
This assignment introduces the end-to-end NLP lifecycle using real corporate disclosures. You may use any generative AI tool, but you must still build and explain the pipeline yourself.
You will analyze 2024 Form 10-K filings to answer the following business question:
Can corporate risk language be used to understand and predict industry-level risk exposure?
This assignment establishes the cleaned corpus, labels, and baselines that will be reused in Assignments 2–5, where you will build LLM-based and RAG pipelines.
Business Context
An investment firm is reassessing sector exposure amid increasing uncertainty related to:
- regulation
- technological disruption
- operational risk
- macroeconomic volatility
Rather than relying only on numerical indicators, the firm wants to understand how companies describe risk in their own words.
Each Form 10-K filing includes a Standard Industrial Classification (SIC) code, which will be used as a high-level industry label for analysis and prediction.
Data Source
You will use SEC-provided TXT versions of 2024 Form 10-K filings, available here:
SEC 2024 Form 10-K TXT Files
- You will use all SEC-provided TXT versions of 2024 Form 10-K filings contained in the shared Google Drive folder.
- All files in the folder must be processed programmatically.
- You may later filter, subset, or group the data for analysis, but your NLP pipeline must be capable of ingesting the entire corpus.
Accessing the Files in Google Colab
Add Folder to Your Drive
- Open the link above.
- Right-click the folder → Add shortcut to Drive
- Save it in My Drive
Mount Drive in Colab
from google.colab import drive
drive.mount('/content/drive')Verify Files
import os
os.listdir("/content/drive/MyDrive")Adjust paths as needed.
1 How should the 10-K corpus be constructed for industry-level analysis?
1.1 Business Goal
- Load the raw text data from all 10-K filings
- Create a clean, reproducible corpus that supports comparison across industries.
- You can choose 7 different industries from the SEC filings.
- Justify why your selection is rational and what are you gaining from it.
1.2 Technical Instructions
- Load and process all Form 10-K TXT files in the shared folder:
- Do not manually select files
- Your code must iterate over the directory programmatically
- For each filing:
- Load the full TXT file
- Extract:
- Item 1A – Risk Factors
- Item 7 – MD&A
- Extract and store:
- company identifier
- SIC code
- Build a structured corpus where each record contains:
- company
- SIC code
- section name
- cleaned text
- sentence-level chunks
- Save intermediate artifacts:
- cleaned section text
- sentence-level files
- metadata tables
- For analysis and reporting only, select:
- at least 10 industries
1.3 Expected Outputs
- Table: company → SIC → industry group → section length
- Folder structure showing saved artifacts
- Brief explanation of cleaning decisions
All Form 10-K files in the shared folder must be processed as part of the NLP pipeline.
For Questions 2–4, you may subset the processed corpus for analysis, visualization, and modeling.
2 How should corporate risk language be represented for NLP analysis?
All text representations must be generated for the entire corpus, even if only a subset is used for downstream modeling.
2.1 Business Goal
Determine how narrative risk disclosures should be numerically represented for analysis and prediction.
2.2 Technical Instructions
- Build a classical NLP representation:
- TF-IDF on Item 1A text
- Explain tokenization and stopword choices
- Build a neural representation:
- Sentence embeddings using a Hugging Face model
- Aggregate to document-level embeddings
- Ensure all representations retain:
- company identifier
- SIC code
- section source (Item 1A or Item 7)
- Compare representations:
- dimensionality
- sparsity
- semantic coverage
2.3 Expected Outputs
- Comparison table of representations
- Short explanation of trade-offs
3 Can risk language predict industry membership?
For modeling and evaluation, you may restrict the dataset to a subset of SIC groups with sufficient sample size.
3.1 Business Goal
Evaluate whether how firms describe risk contains enough signal to predict industry classification.
3.2 Technical Instructions
3.2.1 Prediction Task
- Input: Item 1A risk text
- Target: SIC-based industry group (coarse-grained)
3.2.2 Model A: Classical NLP
TF-IDF features
Logistic Regression or Linear SVM
Report:
- accuracy
- confusion matrix
3.2.3 Model B: Basic Neural NLP
Sentence/document embeddings
Feedforward neural network (MLP)
Report:
- accuracy
- comparison vs classical model
3.2.4 Analysis
- Compare model performance
- Identify industries that are frequently misclassified
- Discuss interpretability vs performance
3.3 Expected Outputs
- Model performance table
- Confusion matrices
- Short technical discussion
4 How reliable are NLP-based risk assessments for business decision-making?
4.1 Business Goal
Assess whether these models are decision-ready for real investment use.
4.2 Technical Instructions
- Identify at least three failure cases, such as:
- boilerplate language
- ambiguous phrasing
- sentiment mismatch
- Perform manual validation:
- inspect representative sentences
- explain why models failed
- Compare:
- classical vs neural models
- strengths and weaknesses
- Write an executive summary ($ $ page):
- key findings
- limitations
- recommendations
4.3 Expected Outputs
- Failure case table
- Executive-style memo
Use of Generative AI
Generative AI tools may be used for:
- code assistance
- explanation
- summarization
They may not replace required pipeline steps.
You must briefly document:
- where AI tools were used
- how outputs were validated or modified
Looking Ahead
In Assignments 2–5, you will:
- replace classical models with large language models
- build retrieval-augmented generation (RAG) pipelines
- use vector search and neural retrieval
The cleaned corpus, SIC labels, sentence chunks, and embeddings created here will be reused directly.
Resources
- SEC EDGAR Filings Guide
- Hugging Face Transformers Documentation
- scikit-learn NLP Documentation
- PyTorch Documentation